Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012

نویسندگان

  • Philipp Schaer
  • Daniel Hienert
  • Frank Sawitzki
  • Andias Wira-Alam
  • Thomas Lüke
چکیده

We will report on the participation of GESIS at the first CHiC workshop (Cultural Heritage in CLEF). Being held for the first time, no prior experience with the new data set, a document dump of Europeana with ca. 23 million documents, exists. The most prominent issues that arose from pretests with this test collection were the very unspecific topics and sparse document representations. Only half of the topics (26/50) contained a description and the titles were usually short with just around two words. Therefore we focused on three different term suggestion and query expansion mechanisms to surpass the sparse topical description. We used two methods that build on concept extraction from Wikipedia and on a method that applied co-occurrence statistics on the available Europeana corpus. In the following paper we will present the approaches and preliminary results from their assessments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

Distributed dense word vectors have been shown to be effective at capturing tokenlevel semantic and syntactic regularities in language, while topic models can form interpretable representations over documents. In this work, we describe lda2vec, a model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors. In contrast to continuous den...

متن کامل

MedLDA: maximum margin supervised topic models

A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploit...

متن کامل

CEA LIST's Participation at the CLEF CHiC 2013

For our first participation to the CLEF CHiC Lab, we submitted runs to the multilingual ad-hoc and multilingual semantic enrichment tasks. Given the strong multilingual character of the evaluation corpus, the main objectives of the experiments were to test the efficiency of semantic topic expansion and consolidation based on Explicit Semantic Analysis (ESA) versions in different languages. Anot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1208.3952  شماره 

صفحات  -

تاریخ انتشار 2012